Towards an automated data cleaning with deep learning in CRESST

The CRESST experiment employs cryogenic calorimeters for the sensitive measurement of nuclear recoils induced by dark matter particles. The recorded signals need to undergo a careful cleaning process to avoid wrongly reconstructed recoil energies caused by pile-up and read-out artefacts. We frame this process as a time series classification task and propose to automate it with neural networks. With a data set of over one million labeled records from 68 detectors, recorded between 2013 and 2019 by CRESST, we test the capability of four commonly used neural network architectures to learn the data cleaning task. Our best performing model achieves a balanced accuracy of 0.932 on our test set. We show on an exemplary detector that about half of the wrongly predicted events are in fact wrongly labeled events, and a large share of the remaining ones have a context-dependent ground truth. We furthermore evaluate the recall and selectivity of our classifiers with simulated data. The results confirm that the trained classifiers are well suited for the data cleaning task.


Introduction
Dark Matter (DM) particles are hypothetical particles beyond the Standard Model of particle physics (SM) and thought to make up (83.9 ± 1.5)% of all matter in our universe [1]. The experimental search for particle DM inspired many experiments in the past decades. The Cryogenic Rare Event Search with Superconducting Thermometers (CRESST), located in the Laboratori Nazionali del Gran Sasso, is a direct DM search and uses scintillating, cryogenic calorimeters as targets. This technology achieves a) low nuclear recoil thresholds, with a currently lowest reported value of 10 eV [2]; b) currently the strongest exclusion limits on spin-independent (spin-dependent) DM-SM interactions for DM masses in the range 0.16-1.5 (0.25-1.5) GeV/c 2 , under standard assumptions [3,4]. However, the sensitivity of the detectors and readout electronics cause not only particle recoils to trigger, but also a variety of artefacts: spikes, drifts, jumps and glitches in the noise baseline (BL) from the readout and heater electronics, and piled up pulse shapes (PSs) from multiple particle recoils in close temporal proximity. The recorded signals therefore need a cleaning step, before a meaningful data analysis can be started.
The standard approach is based on the calculation of PS and BL features, as e.g. the pulse height (PH), rise and decay time and BL slope. The cleaning is then carried out by an analyst, who defines individual rejection regions (cuts) in the space of the calculated a e-mail: felix.wagner@oeaw.ac.at (corresponding author) Fig. 1 Particle recoils produce a pulse-shaped record (blue). Flux quantum losses of the SQUID amplifier in the read-out circuit are caused by fast magnetic field changes, e.g. from high energy recoils (orange). Decaying BLs are residuals from earlier high energy pulses (green). Pile-up originates from multiple particle recoils within the same record window (red) features for each detector. Automating this process brings two benefits: a) detector setups with a large number of simultaneously operated detectors require less human effort to clean and analyse the recorded data and b) it helps preventing biases from individual decisions of the analyst made in the manual intervention.
In each detector a particle recoil produces a characteristic PS, determined by the thermal properties of the target and sensor. Artefacts usually deviate from this characteristic PS (see Fig. 1). One common multi-purpose data cleaning method is to fit the numerical array of the characteristic PS to each record and reject all those, whose fit error exceed a certain value. This method requires a relatively high computational cost for each record, either the prior knowledge of the detector-specific PS or dedicated training data from the same detector to build a template, and manual interventions in tuning the cut values. Furthermore, the discrimination power of cuts is often limited by overlaps of the artefact and target recoil feature distributions, where only correlated cuts on multiple features could achieve an optimal cleaning of the data.
In our work, we create a data set with samples from measurements that were done in CRESST-II and CRESST-III. We clean the data for each detector individually, and we label each record as accepted (positive) or rejected (negative). The data cleaning task is now equivalent to a binary, supervised time series classification problem, i.e. learning to discriminate between positive and negative records. We approach this task with deep neural networks, with which promising results were obtained for time series classification tasks [5].
Similar supervised deep learning approaches were successful in discriminating between individual PSs, originating from different types of particle recoils [6][7][8], or recoils in different positions and components of the detector [9][10][11][12]. The discrimination and reconstruction of pile-up artifacts was studied in Refs. [13,14]. The task of general data cleaning for cryogenic calorimeters with autoencoders (AEs) and variational AEs (VAEs) was studied in Refs. [12,15], and with a Principal Component Analysis (PCA) in Ref. [16]. Differently to those approaches, we do not rely on previous knowledge about the detector to which our method is applied, or the tuning of a cut value. We explore the synergies with the PCA method from Ref. [16] in Sect. 4.3, by combining our approaches. A method for the supervised discrimination between pulses and artifacts with neural networks was proposed in Ref. [17], and shown on purely simulated data. We train and verify with both measured and simulated data, which adds necessary reliability to the results.
Our work was organized as follows: first, we create a large-statistics data set of labeled events. The procedure is described in Sect. 2. Second, we show that neural networks can learn the data cleaning task. The performance of our chosen models on the training set is presented in Sect. 3. Finally, we bridge the gap between triggering and the parts of the event selection which require detector-specific knowledge or tuning. In Sect. 4, we apply the trained models to a test set or recorded data and to simulated positive and negative events.

Used data and pre-processing
The CRESST detector concept is based on a multi-channel read out. Within a joint housing (a detector module), a phonon and a light detector measure the produced phonon population and scintillation light from a particle recoil inside a scintillating crystal (the target). The detector response is measured with a Transition Edge Sensor (TES), operated in a read-out circuit with a SQUID amplifier. For a detailed description of TES we refer to Refs. [18,19]. In some detector modules, additional TES serve to veto energy depositions in the housing or holding structure. In the context of this work, we treat all of them as individual detectors. We create our data set by extracting samples of several tens of hours measurement time from all measurements that were done between 2013 and 2019 in the CRESST setup at LNGS, for a total of 68 detectors and more than one million records. We exclusively use data from the e − /γ and neutron-calibration runs. These are dedicated measurement campaigns before or after the data for dark matter searches was taken. During the e − /γ calibration, a Co-57 source is used to produce a characteristic calibration line in the energy spectrum of the detector. During the neutron calibration, an AmBe source is used to produce a strong neutron flux and calibrate the response of the detectors for particles with no electromagnetic charge. By only using calibration data in this work we additionally prevent biasing effects for a potential application of the trained models to physics searches in future work. We chose seven detectors as a test set, these make up three detector modules: two times a target and one veto detector in a joint housing, and once a target and two veto detectors. These detectors were used only to evaluate the selection of the trained models, reported in 4. For the optimization of hyperparameters of the classifiers and fit process, we split five percent of the data from the remaining 61 detectors into an individual validation set. The total sizes of the data sets are summarized in Table 1. The data preparation and cuts are done with the Python package 'Cait' [17].
We want our classifier to successfully perform these operations: -reject all jump, drift, spike, glitch and pile-up artifacts that deviate significantly from a recoil-type PS, -reject all PSs that rise far away from the trigger position at 1/4 of the record window or do not decay within the window, -let all PSs survive that fit the above criteria, not only those from target recoils, and also if they show saturation effects typical for high energy recoils, -let empty noise traces survive if their slope is within the typical slope of noise traces for the corresponding detector.
We apply to the triggered and noise events cuts on the PS parameters, to create the desired positive and negative labels for the classifier training. The used PS parameters include the PH, onset, rise and decay times of pulses, the difference between the offset levels on the left and right side of the record, the maximal and minimal derivatives and their positions within the record, and the mean, variance and skewness of the values in the record. We define rejection regions with the rectangle and lasso selection tools of Cait's VizTool, as described in Ref. [17].
However, the applied cuts are imperfect, and generally tend to reject more pulses than necessary, which introduces a small share of wrongly labeled records in our data set. Their impact on the trained classifiers and reported performance metrics is studied in Sect. 4.1. The training set consists to 83.6% of positive records. We counter this imbalance with data augmentation, explained later in this section, and a weighted loss function, explained in Sect. 3.
We applied several transformations to the records as a preprocessing. The recordings were initially made with a sampling frequency of 25 kHz and record window lengths of 8192 or 16384 samples, depending on the measurement campaign (run). First, we downsampled the records to the length of 512 samples. The values in the downsampled time series correspond therefore to a time interval of 0.64 or 1.28 ms, while those in the original records corresponded to 40 mus. This time resolution and record length is sufficient to distinguish pulses from artefacts, and the step significantly reduces the necessary computing power for optimization. In a second step, we scale the values within all records individually, such that every record's minimal value is 0 and maximal value is 1. The information of the amplitude of individual records is, without further information about the corresponding detector, arbitrary and all useful information is contained in the shape of the time series.
The classifier optimization is an iterative process, for which the training set is split in so-called mini-batches of 64 records each. Three potential biases were identified in our data set, and their impact was mitigated by data augmentation methods: (a) while we have an over-density of positive records, we have only a relatively small amount of pile-up artefacts. However, they are the class of artifacts that is most difficult to clean from the data with standard cuts. We therefore randomly pick several of the positive records in each mini-batch and superpose them with time-shifted copies of themselves, to artificially create pile-up events. The shifts are chosen such that the superposed pulse appears at a random position within the record window. We then changed their label from positive to negative. (b) The data set contains many records with relatively high Signal-to-Noise Ratio (SNR), because most recorded pulses originate from calibration sources with typical energy depositions far above the detection threshold. A common method to robustify neural networks against small numerical deviations is to randomly add Gaussian noise to the inputs. We apply that method and by that artificially create records with lower SNR. (c) The record window is built such that the trigger time is at 1/4 of its length. While large deviations in the pulse onset from this position are interpreted as artefacts and rejected, pulses with small deviations, of the order of milliseconds, should still be accepted. To implement this objective in our data set, we randomly shift all records by up to 26 samples (33.28 or 16.64 ms). The augmentations are applied to a record when it is drawn from the training set. Augmentation (c) is always applied, but (a) and (b) only with a probability of 20%.
A mini-batch of 64 records from the training set is shown in Fig. 2.

Models and training process
A neural network classifier is a function f w , which is parameterized with so-called weights w ∈ R M , that maps an element x from a data set D ⊂ R N to a predictionŷ ∈ (0, 1). A lower (higher) value indicates a higher belief that x corresponds to a negative (positive) label y ∈ {0, 1}. The number N corresponds to the dimensionality of the data, namely the number of samples in the record  [20]. Furthermore, while the computational cost of many other function approximators increases exponentially with their number of parameters, neural networks do not suffer from this curse of dimensionality. These properties make them useful for the classification of high dimensional and strongly correlated data, which is the case for our raw sensor signals. For a high-quality pedagogical introduction to neural networks and machine learning we recommend Ref. [21].
For this work, we used four different models: -A small Convolutional Neural Network (CNN). This model applies a set of filters to the input, to extract meaningful features. It is highly efficient for spacially correlated data, e.g. for images. Our CNN applies successively 3 layers of filters, and extracts 64 feature channels. The filter kernels are weights, i.e. they are optimized jointly with the classifier parameters. -A larger convolutional model, which we will call Time Series Convolutional Network (TSCN). The performance of convolutional neural networks often depends on their number of weights, i.e. their size. To compare the capabilities of both a small and a larger model, we include this network with 8 layers of filters and 128 feature channels. -A bidirectional Long Short Term Memory (LSTM) network [22]. This model accumulates the information from a sequence of inputs into two internal states, one optimized to recognize long term dependencies, and one to recognize short term dependencies. It is widely used for sequential data, e.g. for natural language processing and time series. We use two bidirectional layers, and internal states with the dimensionality 200. -A for Time Series classification optimized Transformer (TST) [23]. Transformers are attention-based models, that recently have shown great performance on multiple types of data, among them time series applications [24]. They are very efficient in recognizing dependencies between different parts of sequences and the new state-of-the-art large language models [25]. We use the implementation of Ref. [26].
The specifications and the hyperparameters of the models used in this work are given in in App. A. All implementations are built in PyTorch [27]. The weights of neural networks are iteratively optimized (learned) by minimizing an error (loss) function between its predictions and the labels. This process is called the 'training' of the model. We minimize the binary cross entropy loss function, a quantification of the prediction error commonly used for classifier training ([21], p. 49). We weight the loss resulting from positive and negative records individually to counteract the preponderance of positive records. This way, both groups have the same impact in the optimization process. A sweep through the whole training data set, i.e. one iteration through all mini-batches, is called an epoch. We train all models for 15 epochs. We randomize the mini-batches with a technique called weak shuffling: individual mini-batches consist of records from the same detector, while the order in which mini-batches are used in the training process is randomized. This technique significantly reduces the data loading time during training compared to randomization within each mini-batch, as the records are stored consecutively on our hard drive. The optimization process was done on a Tesla P100 GPU and with the ADAptive Moment estimation (ADAM) optimizer [28], a variant of stochastic gradient descent. The optimizer does one update of the weights We optimized the learning rates for all models individually with the cyclic learning rate finder technique, described in Ref. [29]. The learning rates used for each model are listed in Table 7.
After each training epoch, we evaluate the loss value on our validation set. The loss values on the training and validation set for all models throughout the training process are shown in Fig. 3. All curves resemble the typical shape of neural network training curves, and in three out of four models the loss on the validation set continues to decrease proportionally to the loss on the training set. Only for the TST model, the validation loss rises slightly, which indicates overfitting on the training data. We experimented with the models hyperparameters to improve the overfitting, but other configurations lead to overall worse results on the validation set. We saved the trained model after each epoch, and can therefore use those which had the best agreement between labels and predictions (accuracy) on the validation set. The TSCN model shows the lowest loss value on the validation set after the training process. We do not apply the data augmentations (see Sect. 2) on the validation data, therefore, the validation loss is generally lower than the training loss.

Results
We run multiple tests with our trained models: we evaluate the balanced accuracy, recall and precision on the test set, the recall (selectivity) on simulated particle recoil (pile-up) events and we visualize the data manifold before and after application of our model with a PCA. Additionally, we compare a PH spectrum cleaned with classical cuts to one cleaned with our models. The metrics used in this section are defined as follows: where T are positive records, N are negative records, TP are true positive predictions and TN are true negative predictions. R(μ) and S(μ) are the recall and selectivity as functions of PS features. Ω is the region over which we integrate the recall or selectivity and μ is a placeholder for one or multiple PS features (e.g. the PH) w.r.t. which we perform the integration.

Evaluation on the test set
We apply all models to the test set. To use the models as classifiers, a cutoff on the output value (the belief) has to be defined, below which we reject the record. We evaluate the balanced accuracy, the precision and the recall w.r.t. the cutoff value. They are visualized in Fig. 4. For specific applications, the cutoff value can be tuned to fulfill specific demands to recall or precision. In this work, we choose a generic cutoff value of 0.5 for all classifiers.
The recall and precision are naturally in a opposing relationship, explaining the typical kink in the curve shape. The LSTM and TSCN models give the best performance across all metrics. The metrics for the cutoff value 0.5 are reported in Table 2. The predictions on the test set were done in batches as well, with a batch size of 32. The total run times for the predictions on the test set are reported in Table 2. There are typically several thousand predictions done per second, which is due to the low computational cost for inference with neural networks, and the strong parallelization of the necessary matrix multiplications on the GPU.  Table 2 The metrics of the trained models, evaluated on the test set and simulated data. (col. 2-4) The balanced accuracy, recall and precision on the test set with a cutoff value of 0.5. (col. 5, 6) The IOR score for the simulated positive particle recoils, and the IOS score for the simulated negative pile-up events. The values are defined in the text. (col. 7) The runtime for predicting the whole test set. In brackets is the runtime divided by the lowest runtime The score of the best model in each metric is marked in bold We take a closer look to the wrongly predicted records. For this, we use the predictions of the LSTM network, and pick one of the detectors from the test set. The detector has 70 records that were wrongly predicted, out of 8422 records in total. We show the first 64 of the wrongly predicted records in Fig. 5. After visual inspection, we discover that 39 of them are wrongly labeled, mostly good pulses that were collaterally rejected by the imperfect quality cuts on the main parameters. Among the wrongly labeled records there are also noise traces for which the decision if they should survive quality cuts or not depends on the distribution of noise traces in the individual detector. In a full analysis of a detector, these events would most likely not surpass the trigger threshold and are therefore irrelevant.

Evaluation on simulated data
The evaluation on data labeled by cuts has the drawback of wrong labels, we therefore do a second evaluation on simulated data. For this, we superpose the typical pulse shape of one detector from our test set onto empty noise traces that were generated according to Ref. [30]. We simulate a data set of 50000 positive particle recoil records and negative pile-up records each, with PHs between zero and three hundred SNR, i.e. the ratio of the PH of the event and the BL noise resolution. For the pile-up events the onset of the first pulse is placed to 1/4 of the window, the second is placed randomly within the window. We evaluate the recall on the positive records w.r.t. the SNR. The result is visualized in Fig. 6 (left). We calculate the IOR for the SNR-range from three to three hundred. Furthermore, we evaluate the selectivity on the pile-up events (Fig. 6, right), w.r.t the difference in onset and relative difference in PH. We calculate the IOS over the whole plotted range. Both IOR and IOS values are reported for all models in Table 2. The LSTM and TSCN score best and equally good in IOS, while the TST scores best in IOR. Later result is however due to the fact, that the TST defaults to high output values, which is also visible in Fig. 4 (left). The convolutional models CNN and TSCN feature a significantly lower runtime as the LSTM and TST.

Pulse height spectrum and data manifold
As a last evaluation of our model, we choose a 3-channel detector module from the test set. For this detector, the target is a cylindrical calcium tungstate crystal. The first channel of the module is a TES placed on the target, collecting the thermal and athermal phonons produced in particle recoils. The second channel is a TES placed on a shielding ring structure around the first TES, which is directly connected to both the target and the light detector. The third channel is a TES on the light detector, a beaker-shaped object that fully surrounds the target. Fig. 5 A batch of events from the test set that were wrongly predicted by the LSTM. The grey color indicates wrong labels. Some records, among them the tilted BLs, can hardly be flagged as positive or negative without additional context, namely the distribution of the remaining data of the corresponding detector We apply our LSTM model to all channels of the detector module, and call the events clean that correspond to positive prediction for all three channels. In a technique called manifold visualization we want to visualize the distribution of the data in the high dimensional vector space of their sample values. To include the information of all three channels, we concatenate the time series of the records that correspond to the same event in different channels to vectors of length 1536, and form by that a combined data matrix. We do this for all data, and only the cleaned data, for which we use a PCA. The PCA is a singular value decomposition, i.e. the calculation of eigenvectors and eigenvalues, of the covariance matrix of the combined data matrix. The first (second) eigenvector of this matrix, called the first (second) principal component (PC), is itself a time series of length 1536 that corresponds to the template that accounts for the highest (second highest) variance of values in the concatenated records. For a more detailed description of the PCA method, we refer to Ref. [16], where it was first applied for pulse shape identification in cryogenic calorimeters. We plot the projection of the full raw data to its first and second PCs in Fig. 7 (left) and the same for only the cleaned data in Fig. 7 (right). The projection is calculated by building the dot product of the concatenated records and the PC. This corresponds to a change of basis in the high dimensional vector space of the sample values, with the first and second PCs as the new basis vectors. While the variance in the raw data is dominated by steps in the record windows induced by flux quantum losses of the SQUID amplifier (example in Fig. 1, orange), the cleaned data manifold resembles the actual particle recoil types present in the data: target hits, hits in the ring veto detector, and direct hits of the light detector. For the target and veto detector hits electron and nuclear recoils are also separately visible in the plot, due to the different amplitude in the light channel. The identification of the lines of individual event types in Fig. 7 (right) was done by comparing the equivalent plot for e − /γ and neutron calibration, and visual inspection of events and principal components individually. The first (second) principal component strongly resembles the PS of phonon-only (light-only) events. Thus, the line structure we observe in Fig. 7 (right) is in agreement with our expectation. Direct light detector hits, with no phonon signal, cluster along the second PC. Target hits cluster mostly along the first PC, but they differ in the amount of produced scintillation light. Electron hits produce significantly more scintillation light, and ring hits produce an additional signal in the light detector due to a weak thermal link between them.
Additionally, we show the PH spectrum of the target channel of the same detector module without any cuts, with the analysis cuts that we used as labels, and cutted with the LSTM model, in Fig. 8. The PH is for low energy particle recoils proportional to the recoil energy. We observe a strong agreement of the LSTM model and the analysis cuts over the full width of the spectrum. The uncleaned data is strongly polluted with artifacts.
To summarize, we have shown in Sect. 4, that we reliably discriminate events originating from particle recoils and artefacts. Differently to standard approaches, we do not rely on prior knowledge of individual detectors, as its characteristic pulse shape, or manual interventions, as finding individual cut values. We have shown in Sect. 4.3, that our trained models reproduce the objectives of the cuts that were made by analysts. The event selection, and with that the physics reach of the experiment, does therefore not change significantly whether our model is applied or the selection is done per hand. This is the outcome we had hoped for and our results can have the following impacts: Fig. 6 Metrics of the classifier models, evaluated on simulated data. (left) The recall values w.r.t. the SNR of simulated events. The recall drops towards lower values, but is still reasonably high around a typical trigger threshold value of 5 BL noise resolutions (grey, dashed). The reason for the local minimum of the CNN curve above 10 SNR is not cogently clearified. The most likely hypothesis is the absence of many low energy pulses in the training set, which can introduce a bias in models predictions. The simultaneous dip in the recall of multiple models around 80 SNR is a small sample effect of the simulation: it could be connected to two simulated events with similar energy, with relatively strongly tilted BLs. (right) The selectivity values for the LSTM model on simulate pile-up events, featuring two pulses, w.r.t. the difference in onset and relative difference in PH. Only pile-up events with large relative PH difference or very small onset difference are not rejected by the model. The area that is covered by the inset holds only selectivity values of one. (right, inset) An example of a simulated pile-up event Fig. 7 The data manifold visualized with the first first two principal components. (left) The raw data, without cleaning (black) and the cleaned data (orange), both projected to the first and second principal components of the raw data matrix. (right) The cleaned data projected to the first and second principal components of the cleaned data matrix. The lines corresponding to the individual event types are clearly visible. The PH spectrum of the target channel is shown in Fig. 8 1. For large-scale detector setups the fine tuning of dedicated cuts for each detector can produce non-negligible delays in the analysis or even become infeasible. The application of our models instead can produce equivalent cuts instantly. 2. Recorded data can be monitored in real time. This can uncover unwanted shifts of the measurement setup, as e.g. an increased rate of artifacts, immediately and enable fast interventions. Furthermore, features in the event distribution, e.g. a peak in the PH spectrum, can be identified as particle-like or artefact-like, without the need for designing cuts first.

Conclusion and outlook
We trained a selection of deep learning models to perform the data cleaning task for cryogenic detectors. We observed equally promising performance from a convolutional neural network, TSCN, and an LSTM neural network. The best achieved balanced accuracy score on the test set is 0.932, with a recall of 0.986 and precision of 0.985. Notably, the majority of the wrongly predicted records by the best performing model are either wrongly labeled records or hard-to-decide records. The method therefore apparently outperforms the cuts which we performed on the data, in the task of discriminating pulse-shaped from other records. For practical applications, our method has another advantage: the runtime is with O(1k-10k) predictions per second significantly faster than standard fit methods that are a typical alternative for the same purpose. Note, that our method can be applied to a new detector blindly, i.e. without relying on any information from the detector itself, as typical PSs. We therefore reduce the necessary manual task in data cleaning from handcrafting cut values for each detector individually, to merely controlling the predictions of the model. Fig. 8 The PH spectrum of an exemplary detector without cleaning (black), with the cut analysis that we used as labels (blue) and the LSTM predictions (LSTM). The blue and orange curves almost fully overlap due to the strong agreement between cuts and LSTM. The data manifold of the corresponding 3-channel detector module is visualized in Fig. 7 In future work, our method can be extended with an appropriate, unsupervised clustering of the different PSs. A promising approach was presented in Ref. [15]. Furthermore, an investigation of the poor Transformer performance could be started. Likely the extraction of proper representations of the time series, as done for raw audio data in the popular wave2vec 2.0 framework [31], would improve its performance. Fully connected 12800 input nodes, 1 output node, sigmoid activation function